In Search of an Entity Resolution OASIS: Optimal Asymptotic Sequential Importance Sampling

نویسندگان

  • Neil G. Marchant
  • Benjamin I. P. Rubinstein
چکیده

Entity resolution (ER) presents unique challenges for evaluation methodology. While crowd sourcing provides a platform to acquire ground truth, sound approaches to sampling must drive labelling efforts. In ER, extreme class imbalance between matching and non-matching records can lead to enormous labelling requirements when seeking statistically consistent estimates of population parameters. This paper addresses this important challenge with the OASIS algorithm. OASIS draws samples from a (biased) instrumental distribution, chosen to have optimal asymptotic variance. As new labels are collected OASIS updates this instrumental distribution via a Bayesian latent variable model of the annotator oracle, to quickly focus on regions providing more information. We prove that resulting estimates of F-measure, precision, recall converge to the true population values. Thorough comparisons of sampling methods on a variety of ER datasets demonstrate significant labelling reductions of up to 75% without loss to estimate accuracy.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Asymptotic properties of the sample mean in adaptive sequential sampling with multiple selection criteria

‎We extend the method of adaptive two-stage sequential sampling to‎‎include designs where there is more than one criteria is used in‎‎deciding on the allocation of additional sampling effort‎. ‎These‎‎criteria‎, ‎or conditions‎, ‎can be a measure of the target‎‎population‎, ‎or a measure of some related population‎. ‎We develop‎‎Murthy estimator for the design that is unbiased estimators for‎‎t...

متن کامل

An Optimal Approach to Local and Global Text Coherence Evaluation Combining Entity-based, Graph-based and Entropy-based Approaches

Text coherence evaluation becomes a vital and lovely task in Natural Language Processing subfields, such as text summarization, question answering, text generation and machine translation. Existing methods like entity-based and graph-based models are engaging with nouns and noun phrases change role in sequential sentences within short part of a text. They even have limitations in global coheren...

متن کامل

Optimal Capacitor Allocation in Radial Distribution Networks for Annual Costs Minimization Using Hybrid PSO and Sequential Power Loss Index Based Method

In the most recent heuristic methods, the high potential buses for capacitor placement are initially identified and ranked using loss sensitivity factors (LSFs) or power loss index (PLI). These factors or indices help to reduce the search space of the optimization procedure, but they may not always indicate the appropriate placement of capacitors. This paper proposes an efficient approach for t...

متن کامل

Optimal SIR algorithm vs. fully adapted auxiliary particle filter: a non asymptotic analysis

Particle filters (PF) and auxiliary particle filters (APF) are widely used sequential Monte Carlo (SMC) techniques. In this paper we comparatively analyse, from a non asymptotical point of view, the Sampling Importance Resampling (SIR) PF with optimal conditional importance distribution (CID) and the fully adapted APF (FA). We compute the (finite samples) conditional second order moments of Mon...

متن کامل

oASIS: Adaptive Column Sampling for Kernel Matrix Approximation

Computing with large kernel or similarity matrices is essential to many state-ofthe-art machine learning techniques in classification, clustering, and dimensionality reduction. The cost of forming and factoring these kernel matrices can become intractable for large datasets. We introduce an an adaptive column sampling technique called Accelerated Sequential Incoherence Selection (oASIS) that sa...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • PVLDB

دوره 10  شماره 

صفحات  -

تاریخ انتشار 2017